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Abstract 

Motivation: High-throughput sequencing enables expression analysis at the level of individual 
transcripts. The analysis of transcriptomc expression levels and differential expression estimation re- 
quires a probabilistic approach to properly account for ambiguity caused by shared exons and finite 
read sampling as well as the intrinsic biological variance of transcript expression. 

Results: We present BitScq (Baycsian Inference of Transcripts from Sequencing data), a Bayesian 
approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative 
expression is represented by Markov chain Monte Carlo (MCMC) samples from the posterior probability 
distribution of a generative model of the read data. We propose a novel method for differential expression 
analysis across replicates which propagates uncertainty from the sample-level model while modelling 
biological variance using an expression-level-dependent prior. We demonstrate the advantages of our 
method using simulated data as well as an RNA-seq dataset with technical and biological replication 
for both studied conditions. 

Availability: The implementation of the transcriptomc expression estimation and differential ex- 
pression analysis, BitScq, has been written in C++. 

Contact: glausOcs . m ain . ac . uk[ lantti . honkela@hii tTfH [MTRattr ay@shef f ie ld . ac . uk| 



1 Introduction 



High-throughput sequencing is an effective approach for transcriptome analysis. This methodology, also 
called RNA-seq, has been used to analyse unkno wn transcrip t seque nces, estimate gene expression lev- 
els and study single n ucleotide polymorphisms ( Wang et al. . 20091 ) . As shown by other researchers 
( Mortazavi et al. . 20081 ) . RNA-seq provides many advantages over microarray technology, although ef- 
fective analysis of RNA-seq data remains a challenge. 

A fundamental task in the analysis of RNA-seq data is the identification of a set of differentially ex- 
pressed genes or transcripts. Results from a differential expression (DE) analysis of indi vidual transcripts 



are e ssential in a diverse range of problems suc h as identifying differe nces between tissues (jMortazavi et al 



20081 ) . underst anding developm ental changes ( Gravelev et al. . 2011 ) and regulator such as microRNA tar- 
get prediction ( Xu et al . 2010l ). To carry out an effective DE analysis it is important to obtain accurate 
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Figure 1: Diagram showing the BitSeq analysis pipeline divided into two separate stages. In Stage 1, 
transcript expression levels are estimated using reads from individual sequencing experiments. In step 
1, reads are aligned to the transcriptome. In step 2, the probability of a read originating from a given 
transcript P(r n \I n ) is computed for each alignment based on Eq. ([1]). These probabilities are used in step 3 
of the analysis, MCMC sampling from the posterior distribution in Eq. ([3]). In Stage 2 of the analysis, the 
posterior distributions of transcript expression levels from multiple conditions and replicas are used to infer 
the probability that transcripts are differentially expressed. In step 4, a suitable normalisation for each 
experiment is estimated. The normalised expression samples are further used to infer expression-dependent 
variance hyperparameters in step 5. Using these results, replicates are summarized by estimating the per- 
condition mean expression for each transcript, Eq. @, in step 6. Finally, in step 7, samples representing 
the distribution of within-condition expression are used to estimate the probability of positive log ratio 
(PPLR) between conditions, which is used to rank transcripts based on DE belief. 



estimates of expression for each sample but it is equally im portant to properly account for all sources of vari- 
ation , technical and biolog ical, to avoid spurious DE calls (jRobinson and Smythl . 120071 : lAnders and Huber 



201(3 : lOshlack et all l2010l ). In this contribution we address both of these problems by developing inte- 



grated probabilistic models of the read generation process and the biological replication process in an 
RNA-seq experiment. 

During the RNA-seq experimental procedure, a studied specimen of transcriptome is synthesised into 
cDNA, amplified, fragmented and then sequenced by a high-throughput sequencing device. This process 
results in a dataset consisting of up to hundreds of millions of short sequences, or reads, encoding observed 
nucleotide sequences. The length of the reads depends on the sequencing platform and currently typically 
ranges from 25 to 300 base pairs. Reads have to be either assembled into transcript sequences or aligned 
to a reference genome by an aligning tool, to determine the sequence they originate from. 

With proper sample preparation, the number of reads aligning to a certain gene is approximately 



prop ortional to the abundance of fragments of transc ripts for that gene within the sample (IMortazavi et al 



20081 ) allowing researchers to study gene expression (jCloonan et all 120081 : iMarioni et aUl2008l ). However, 



during the process of transcription, most eukaryotic genes can be spliced into different transcripts which 
share parts of their sequence. As it is the transcripts of genes that are being sequenced during RNA-seq, 
it is possible to distinguish between ind ividual transcr i pts of a gene. Several methods have been proposed 



to es timate transcript expression l e vels (Li et q/.l . 120101 : 



Nicolae et all . l2010l : iKatz et all l2010l : iTirro et al 



20 111 ). Furthermore, IWang et al\ (|2010i ) showed that estimating gene expression as a sum of transcript 
expression levels yields more precise results than inferring the gene expression by summing reads over all 



exons. 



Since the transcript of origin is undecidable for reads aligning to shared subsequence, estimation of 
transcript expression levels has to be completed in a probabili stic manner. Initial studies of tra nscript 
expression used the Expectation-Maximization (EM) approach ( Li et al. . 2010l : Nicolae et al. . 2010l ). This 
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is a maximum likelihood procedure which only provides a point estimate of tran script abundance a nd does 
not measure the uncertainty in these estimates. To overcome this limitation, Katz et al. ( 20ld ) used a 
Bayesian approach to capture the posteri or distributi o n of t he transcript expression levels using a Markov 
chain Monte Carlo (MCMC) algorithm. iTurro et al] (|201ll ) have also proposed MCMC estimation for a 
model of read counts over regions that can correspond to exons or other suitable subparts of transcripts. 

In this contribution we present BitSeq (Bayesian Inference of Transcripts from Sequencing data), a 
new method for inferring transcript expression and analysing expression changes between c ondit ions. We 
use a probabilistic model of the read generation process similar to the model of iLi et all ifcoid ) and we 



develop an MCMC algorithm for Bayesian inference over the model. iKatz et all (|2010l ) developed an 
MCMC algorithm for a similar generative model but our model differs from theirs because we allow for 
multi-aligned reads mapping to different ge nes. Furthermore, we infer the overall relative expression of 
transcripts across the transcriptome whereas IKatz et al\ (201CJ) focus on relative expression of transcripts 
from the same gene. We have implemented MCMC using a collapsed Gibbs sampler to sample from the 
posterior distribution of model parameters. 

In many gene expression studies expression levels are used to select genes with differences in expression 
in two conditions, a process referred to as a DE analysis. We propose a novel method for DE analysis 
that includes a model of biological variance while also allowing for the technical uncertainty of transcript 
expression which is represented by samples from the posterior probability distribution obtained from 
the probabilistic model of read generation. By retaining the full posterior distribution, rather than a 
point estimate summary, we can propagate uncertainty from the initial read summarization stage of 
analysis into the D E analysis. Sim il ar strategies have b een shown to be effective in the DE analysis 
of microarray data ( Liu et al. . 2006 : Rattrav et al. . 20061 ) but given the inherent uncertainty of reads 
mapping to multiple transcripts we expect the approach to bring even more advantages for transcript- 
level DE analyses. Furthermore, this method accou nts for decreased te chnical reproducibility of RNA-seq 
for low-expressed transcripts recently reported by lLabai et all (|201ll ) and can decrease the number of 
transcripts falsely identified as differentially expressed. 



2 Methods 

The BitSeq analysis pipeline consists of two main stages: transcript expression estimation and differential 
expression assessment, see Figure [TJ For the transcript expression estimation the input data are single-end 
or pair-end reads from a single sequencing run. The method produces samples from the inferred probability 
distribution over transcripts' expression levels. This distribution can be summarized by the sample mean 
in case one is only interested in expression. 

The DE analysis uses posterior samples of expression levels from two or more conditions and all 
available replicates. The conditions are summarized by inferring the posterior distribution of condition 
mean expression. Samples from the posterior distributions are compared to score the transcripts based on 
the belief in change of expression level between conditions. 



2.1 Stage 1: Transcript expression estimation 

The initial interest when dealing with RNA-seq data is estimation of expression levels within a sample. In 
this work, we focus on the transcript expression levels, mainly represented by = (9i, . . . , 9m), the relative 
abundance of transcripts' fragments within the studied sample, where M is the total number of transcripts. 
This can be further transformed into relative expression of transcripts 6^} = 0m/(lrn{^2iLi@i/h)), where 
l m is the length of the m-th transcript. Alternatively, expression c an be represented by re ads per kilobase 
per million mapped reads, RPKM m = 6 m x 10 9 // m , introduced by Mortazavi et al. ( 20081 ). 



We use a generative model of the data, depicted in Figure El which models the RNA-seq data as 
independent observations of individual reads r n G R = {n, . . . , rjv}, depending on the relative abundance 
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Figure 2: Graphical representation of the RNA-seq data probabilistic model. We can consider the 
observation of reads R = (r\, . . . , rjy) as N conditionally independent events, with each observation of a 
read r n depending on the transcript (or isoform) it originated from I n . The probability of sequencing a 
given transcript I n depends on the relative expression of fragments 6 and the noise indicator Z% ct . The 
noise indicator variable Z^ ct depends on noise parameter 6 act , and indicates that the transcript being 
sequenced is regarded as noise, which enables observation of low quality and un-mappable reads. 

of transcripts' fragments and a noise parameter 9 act . The parameter 9 act determines the number of reads 
regarded as noise and enables the model to account for unmapped reads as well as for low-quality reads 
within a sample. 

Based on the parameter 9 act , indicator variable Z^ ct ~ Bern(# arf ) determines whether read r n is 
considered as noise or a valid sequence. For a valid sequence, the process of sequencing is being modelled. 
Under the assumption of reads being uniformly sequenced from the molecule fragments, each read is 
assigned to a transcript of origin by the indicator variable I n , which is given by categorical distribution 
I n ~ Cat(0). 

For a transcript m we can express the probability of an observed alignment as the probability of choosing 
a specific position p and sequencing a sequence of given length with all its mismatches, P{r n \I n = m) = 
P(p\m)P(r n \seq mp ). For paired-end reads we compute the joint probability of the alignment of a whole 
pair, in which case we also have to consider fragment length distribution P(l), 

P(r n 1 \r n 2 \l n = m) = 

P(p\l,m)P(l\m)P(r^\seq mlpi )P(r^\seq mlp2 ) . (1) 

Details of alignment probability computation including optional position and sequence-specific bias cor- 
rection methods are presented in Supplementary Material. For every aligned read, we also calculate the 
probability that the read is from neither of the aligned transcripts, but is regarded as sequencing error or 
noise P(r n \ noise). This value is calculated by taking the probability of the least probable valid alignment 
corrupted with two extra base mismatches. 

The joint probability distribution of the model can now be written as 



P(R, I, Z act ,0, 9 act ) = P{0)P(6 act ) 

X lln=l (P(r n \I n )P(I n \0,Z^)P(Z^\e^)) 



where we use weak conjugate Dirichlet and Beta prior distributions for 6 and 9 act , respectively. The 
posterior distribution of the model's parameters given the data R can be simplified by integrating over all 
possible values of Z act : 

p(i,o,e act \R) <x P(e)P(e^)U n .j^ {P(r n \i n )c & t(i n \e)o act ) 

xnn;/ n =o(^(^|noise)(l-^)) . W 

According to the model any read can be a result of sequencing either strand of an arbitrary transcript 
at a random position. However, the probability of a read originating from a location where it does not align 
is negligible. Thus the term P(r n \I n )Ca,t(I n \0)9 act has to be evaluated only for transcripts and positions 
to which the read does align . To accomplish t his we first align the reads to the transcript sequences using 
the Bowtie alignment tool ( Langmead et all , 20091 ) . preserving possible multiple alignments to different 



transcripts. We then pre-compute P(r n \I n ) only for the valid alignments. (See steps 1-2 in Figure [TJ) 
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The closed form of the posterior distribution is not analytically tractable and an approximation has to 
be used. We can analytically marginalise 6 and apply a collapsed Gibbs sampler to produce samples from 
the posterior probability distribution over I n ( Geman and Geman . 1993 : Griffiths and Stevvers . 20041 ) . 



These are used to compute a posterior for 6, which is the main variable of interest. Full update equations 
for the sampler are given in Supplementary Material. 

In the MCMC approach, multiple chains are sampled a t the same time and convergence is monitored 
using the R statistic as described by Gelman et al. ( 20031 ). The R statistic is an estimate of a possible 



scale reduction of the marginal posterior variance and provides a measure of usefulness of producing more 
samples. Posterior samples of 6 provide an assessment of the abundance of individual transcripts. As 
well as providing an accurate point estimate of the expression levels through the mean of the posterior, 
the probability distribution provides a measure of confidence for the results, which can be used in further 
analyses. 

2.2 Stage 2: Combining data from multiple replicates and estimating differential 
expression 

To identify transcripts that are truly differentially expressed it is necessary to account for biological 
variation by using replication for each experimental condition. Our method summarizes these replicates 
by estimating the biological variance and inferring per-condition mean expression levels for each transcript. 
During the differential expression analysis we consider the logarithm of transcript expression levels y m = 
log# m . The model for data originating from multiple replicates is illustrated in Figure O We use a 
hierarchical log-normal model of within-condition expression. The prior over the biological variance is 
dependent on the mean expression level across conditions and the prior parameters (hyper-parameters) 
are learned from all of the data by fitting a non-parametric regression model. We fit a model for each gene 
using the expression estimates from Stage 1. 

A novel aspect of our Stage 2 approach is that we fit models to posterior samples obtained from the 
MCMC simulation from Stage 1, which can be considered "pseudo-data" representing expression corrupted 
by technical noise. A pseudo-data vector in constructed using a single MCMC sample for each replicate 
across all conditions. The posterior distribution over per-condition means is inferred for each pseudo-data 
vector using the model in Figure [3] (described below). We then use Bayesian model-averaging to combine 
the evidence from each pseudo-data vector and determine the probability of differential expression. This 
approach allows us to account for the intrinsic technical variance in the data; it is also computationally 
tractable because the model for a single pseudo-data vector is conjugate and therefore inference can 
be carried out exactly. This effectively regularizes our variance estimate in the case that the number of 
replicates is low. As shown in Section [33] this provides improved control of error rates for weakly expressed 
transcripts where the technical variance is large. 

For a condition c we assume R c replicate datasets. The log-expression from replicate r, is assumed 

(c) 

to be distributed according to a normal distribution with condition mean expression Hm , normalised by 
replication specific constant rS cr ^ , and precision Xm , yffl ~ Norm^ W cr) ,l/A^ } ). As our parameters 
represent the relative expression levels in the sample, BitSeq implicitly incorporates normalisation by the 
total number of reads or the RPKM measure, as was done when generating the results in this publication. 
Further normalisation can be implemented using the normalisation constant n^ cr \ which is constant for 
all transcripts of a given r eplicate and can be estimated prior to probabilistic modeling using, for example, 
a quantile based method ( Robinson and Qshlack . 20 id ) or any other suitable technique. 



The condition mean expression is normally distributed fjjn ~ NormOu£?,l/(A$A )) with mean /j,m , 

(c) 

which is empirically calculated from multiple samples, and scaled precision 

X&'Ao- The prior distribution 

(c) 

over per-transcript, condition specific precision Am is a Gamma distribution with hyperparameters ag, /3q, 
which are fixed for a group of transcripts with similar expression level, G. 
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Figure 3: Graphical model of the biological variance in transcript expression experiment. For replicate 
r, condition c and transcript m, the observed log-expression level y£P is normally distributed around 
the normalised condition mean expression /if + with biological variance 1/Xm . The condition 
mean expression fim for each condition is normally distributed with overall mean expression /J$ and 
scaled variance 1/(A^ Ao)- The inverse variance, or precision Xm , for a given transcript m follows a 
Gamma distribution with expression-dependent hyperparameters olcPq, which are constant for a group 
of transcripts G with similar expression. 



The hyperparameters ckcPg determine the distribution over per-transcript precision parameter X m 
which varies with the expression level of a transcript (see Supplementary Figure 3 of the supplementary 
material). For this reason, we inferred these hyperparameters from the dataset for various levels of expres- 
sion, prior to the estimation of precision A m and mean expression /x m . We used the same model as Figure 
[3] applied jointly to multiple transcripts with similar empirical mean expression levels fJjti ■ We set a uni- 
form prior for the hyperparameters, marginalized out condition means and precision, an d used a MCMC 
algorithm to sample ag, /3q. The samples of ag, [}q were smoothed by Lowess regression (IClevelandl . [l98lh 
against empirical mean expression to produce a single pair of hyperparameters for each group of transcripts 
with similar expression level. 

This model is conjugate and thus leads to a closed form posterior distribution. This allows us to directly 
sample A m and \i m given each pseudo-data vector y m constructed from the Stage 1 MCMC samples: 



P{VmAm\ym) = I]L Ganlnla (^m K> V&c) 



Norm ( 



(4) 



b c = /3 G + U(^) 2 X + 



Xq+Rc 



Samples of /i™' and [i^f 1 are used to compute the probability of expression level of transcript m 
in condition c\ being greater than the expression level in condition C2- This is done by counting the 
fraction of samples in which the mean expression from the first condition is greater, that is P(^m^ > 

l^rn^\R) = jf J2n=i ^{^m]n > ^rn,n) which we refer to as the Probability of Positive Log-Ratio (PPLR). 
Here, n = 1 . . . N represents one sample from the above posterior distribution for each of N independent 
pseudo-data vectors. Subsequently, ordering transcripts based on PPLR produces a ranking of most 
probable up-regulated and down-regulated tra nscripts. This kin d of one-sided Bayesian test has previously 
been used for the analysis of microarray data (jLiu et allEoOfih . 
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Figure 4: In plots (a) and (b) we show the posterior transcript expression density for pairs of transcripts 
from the same gene. This is a density map constructed using the MCMC expression samples for these three 
transcripts. In (c) we show the marginal posterior distribution of expression levels of the same transcripts 
as illustrated by hi stograms of MCMC samples. The sequencing data is from miRNA-155 study published 
bv lXu et al\ fcoid ). 



3 Results and Discussion 

3.1 Datasets 

We carried out experiments evaluating both gene expression estimation accuracy as well as differential 
expression analysis precision. For the evaluation of bias correction effects as well as comparison with other 
methods (Table [D we used paired-end RNA-seq data from the Microarray Quality Control (MAQC) project 
(jShi et all . 120061 ) (Short Read Archive accession number SRA012427), because it contains 907 transcripts 
which were also analysed by TaqMan qRT-PCR. The results from qRT-PCR probes ar e generally re 



garde d as ground truth expression estimates for comparison of RNA-seq analysis methods ([Roberts et al 



201 ll ). We used RefSeq refGene transcriptome annotation, assembly NCBI36/h gl8 in order to keep r esults 
consistent with qRT-PCR data as well as previously published comparisons bv iRoberts et all <|201lh . 

The second dataset used in our evaluation was originally published by Xu et al. ( 20ld ) in a study 
focused on identification of microRNA targets and provides technical as well as biological replicates for 
both studied conditions. We use this data to illustrate the importance of biological replicates for DE 
analysis (Figure El Supplementary Figure 3 for biological variance) and the advantages of using a Bayesian 
approach for both expression inference and DE analysis (Figure H]). 

For the purpose of evaluating and comparing BitSeq to existing differential expression analysis methods, 
we created artificial RNA-seq datasets with known expression levels and differentially expressed transcripts. 
We selected all transcripts of chromosome 1 from human genome assembly NCBI37/hgl9 and simulated 
two biological replicates for each of the two conditions. We initially sample the expression for all replicates 
using the same mean relative expression and variation between replicates as were observed in the Xu et al. 
data estimates. Afterwards we randomly choose one third of the transcripts and shift one of the conditions 
up or down by a known fold change. Given the adjusted expression levels, we generated 300k single-end 
reads uniformly distributed along the transcripts. The reads were reported in Fastq format with Phred 
scores randomly generated according to empirical distribution learned from the SRA012427 dataset. With 
the error probability given by a Phred score, we generated base mismatches along the reads. 



3.2 Expression level inference 

Figure H] demonstrates the ambiguity that may be present in the process of expression estimation. In 
Figures IHa) and [Hb) we show the density of samples from the posterior distribution of expression levels 
for two pairs of transcripts. The expression levels of transcripts ucOlOoho.l and ucOlOohp.l (Fig. HJa)) 
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read model 


BitSeq 


Cufflinks 


RSEM 


MMSEQ 


uniform 
non-uniform 


0.7677 

0.8011 


0.7503 
0.8056 


0.7632 
0.7633 


0.7614 



Table 1: Comparison of expression estimation accuracy against TaqMan qRT-PCR data and the ef- 
fect of non-uniform read distribution models using correlation coefficient R 2 of average expression from 
three technical replicates with the 893 matching transcripts analysed by q RT-PCR. The seq uencing data 
(SRA012427) is part of the MAQC project and was originally published by IShi et all (j2006T ). 



are negatively correlated. On the other hand transcripts ucOlOoho.l and uc001bwm.3 exhibit no visible 
correlation (Fig. H^b)) in their expression level estimates. Even though this kind of correlation does 
not have to imply biological significance, it does point to technical difficulties in the estimation process. 
These transcripts share a significant amount of sequence and the consequent read mapping ambiguity 
leads to greater uncertainty in expression estimates (See Supplementary Figure 1(d) for transcript profile). 
Bayesian inference can be used to assess the uncertainty due to such confounding factors, unlike the 
maximum likelihood point estimates provided by an EM algorithm. The marginal posterior probability 
of transcript expression for each transcript is shown in Figure H[c). In our analysis pipeline, the marginal 
posterior distributions are propagated into the differential expression estimation stage, thus the uncertainty 
from expression estimation is taken into account when assessing whether there is strong evidence that 
transcripts are differentially expressed. 



3.3 Expression estimation accuracy and read distribution bias correction 

Initially, it was assumed that high-throughput sequencing produces reads uniformly distributed along 



(Dohm et al. 2008 


Wu et al. 


2011; 


Roberts et al.. 



for transcript expression inference (Figure [2|) includes a model of the underlying read distribution which 
in the P(r n \I n = m) term that is calculated as a pre-processing step. The current BitSeq i mplementation 



contai ns the option of using a uniform read density model or using the model proposed bv lRoberts et al. 

(mil) 

which can account for positional and sequence bias. The effect of correcting for read distribution was 
analysed using the SRA012427 dataset and results are presented in Table [T] We also compare B itSeq with 



three ot her transcript exp ression estimation m ethods: Cufflinks vO .9.3 ([Roberts et all l201ll ). MMSEQ 



vO.9.18 (Turro et al 



201lh and RSEM vl.1.14 (|Li and Dewevi . 1201 lh . 
The dataset contains three technical replicates. These were analysed separately and the resulting 
estimates for each method were averaged together. Subsequently, we calculated the squared Pearson 
correlation coefficient (R 2 ) of the average expression estimate and the results of qRT-PCR analysis. All 
four methods used with the default uniform read distribution model provide similar level of accuracy with 
BitSeq performing slightly better than the other three methods. 

Both BitSeq and Cufflinks use the same method for read distribution bias c orrection and provid e 
improvement over the uniform model similar to improvements previously reported bv lRoberts et all (|201ll ). 
We used version 0.9.3 of Cufflinks (as used by Roberts et al.) since we found that the most recent stable 
version of Cufflinks (version 1.3.0) leads to much worse performance for both uniform and bias-corrected 
models (see Supplementary results Section 2.2). The RSEM package uses its own method for bias correction 
based on the relative position of fragments, which in this case did not improve the expression estimation 
accuracy for the se lected transcripts. We were not able to compare the bias corrected results of MMSEQ 
( Turro et all l201ll ) due to an error in an external R package mseq used for the bias correction. However, 
the bias correction of mseq package itself was already compared against C ufflinks on the same dataset 



showing slightly worse accuracy and less improvement ( Roberts et al. . 201 ll ). 
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Figure 5: Comparison of BitSeq to naive approach for combining replicates within a condition for tran- 
script uc001avk.2 of the Xu et al. dataset. (a) Initial posterior distributions of transcript expression levels 
for two conditions (labeled CO, CI), with two biological replicates each (labeled R0, Rl). (b) Mean expres- 
sion level for each condition using the naive approach for combining replicates. The posterior distributions 
from replicates are joined into one dataset for each condition, (c) Inferred posterior distribution of mean 
expression level for each condition using the probabilistic model in Figure [3l (d) Distribution of differences 
between conditions from both approaches show that the naive approach leads to overconfident conclusion. 



In case of BitSeq, the major improvement of accuracy originates from using the effective length normal- 
ization. To compare the results with qRT-PCR, the relative expression of fragments 6 has to be converted 
into either relative expression of transcripts (#*) or RPKM units. Using the bias corrected effective length 
for this conversion leads to the higher correlation with qRT-PCR (Supplementary Table 1). This means 
that using an expression measure adjusted by the effective length, such as RPKM, is more suitable than 
normalized read counts for DE analysis. 

For more results comparing the transcript expression estimation accuracy and within gene relative 
expression accuracy, please refer to supplementary material Section 2.3. 



3.4 Differential expression analysis 

We use the Xu et al. dataset to demonstrate the DE analysis process of BitSeq. This dataset contains 
technical and biological replication for both studied conditions. We observed significant difference between 
biological and technical variance of expression estimates (Supplementary Figure 3). Furthermore, the 
prominence of biological variance increases with transcript expression level. We illustrate how BitSeq 
handles biological replicates to account for this variance in Figure by showing the modelling process for 
one example transcript given only two biological replicates for each of two conditions. 



Figure 5(a) shows histograms of expression level samples produced in the first stage of our pipeline. 
BitSeq probabilistically infers condition mean expression levels using all replicates. For comparison, we 
used a naive way of combining two replicates by combining the posterior distributions of expression into a 



single distribution. The resulting posterior distributions for both approaches are depicted in Figures 5(b) 



and 5(c) 



The probability of differential expression for each transcript is assessed by computing the difference 
in posterior expression distributions of the two conditions. Resulting distributions of differences for both 
approaches are portrayed in Figure |5(d)| with obvious difference in the level of confidence. The naive 
approach reports high confidence of up-regulation in the second condition, with the probability of positive 
log ratio (PPLR) being 0.995. When biological variance is being considered by inferring the condition 
mean expression, the significance of differential expression is decreased to PPLR 0.836. 
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Figure 6: ROC evaluation of transcript level DE analysis using artificial dataset, comparing BitSeq with 
alternative approaches. The curves are averaged over 5 runs with different set of transcripts being differ- 
entially expressed by fold change uniformly distributed in the interval (1.5, 3.5). We discarded transcripts 
without any reads initially generated as these provide no signal. Panel (a) shows global average behaviour 
while in (b), (c) and (d) transcripts were divided into 3 equally sized groups based on the logged mean 
generative read count: [0,1.061), [1.061,2.940), [2.940, oo), respectively. 



3.5 Assessing DE performance with simulated data 



Using artificially simulated data with a predefined set of differentially expressed transcripts, we evaluated 
our app roach and compared it wit h four oth er methods commonly used for di fferential expression analysis . 



DESeq ([Anders and Huberl . l20ld ). edgeR dRobinson et al\. l20ich. bavSe q (lHardcastle and Kellvi . l2O10h 



were designed to operate on the gene level and Cuffdiff (jTrapnell et all [2010j) on the transcript level. 
Despite not being designed for this purpose, we consider the first three in this comparison as the use case 
is very similar and there are no other well known alternatives besides Cuffdiff that would use replicates for 
transcript level DE analysis. All other methods beside Cuffdiff use BitSeq Stage 1 transcript expression 
estimates converted to counts. Details regarding use of these methods are provided in the Supplementary 
material, Section 2.5. Figure [6] shows the overall results as well as split into three parts based on the 
expression of the transcripts. The ROC curves were generated by averaging over 5 runs with different 
transcripts being differentially expressed and the figures are focused on the most significant DE calls with 
false positive rate below 0.2. 



Overall (Figure 6(a)), BitSeq is the most accurate method, followed first by baySeq, then edgeR and 
DESeq with Cuffdiff further behind. This trend is especially clear for lower expression levels (Figure [6(b)[ 
6(c) ). The overall performance here is fairly low because of high level of biological variance. For highest 
expressed transcripts (Figure [6(d)[ ), DESeq and edgeR show slightly higher true positive rate than BitSeq 
and baySeq, especially at larger false positive rates. Further details and more results from the DE analysis 
comparison can be found in the supplementary material Section 2.5. 



4 Conclusion 

We have presented methods for transcript expression level analysis and differential expression analysis 
that aim to model the uncertainty present in RNA-seq datasets. We used a Bayesian approach to provide 
a probabilistic model of transcriptome sequencing and to sample from the posterior distribution of the 
transcript expression levels. The model incorporates read and alignment quality, adjusts for non-uniform 
read distributions and accounts for experiment-specific fragment length distribution in case of paired-end 
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reads. The accuracy of inferred expression is comparable and in some cases outperforms other compet- 
ing methods. Nevertheless, the major benefit of using BitSeq for transcript expression inference is the 
availability of full posterior distributions useful for further analysis. 

The inferred distributions of transcript expression levels can be further analysed by the second stage 
of BitSeq for DE analysis. Given biological replicates, BitSeq accounts for the intrinsic noise and variation 
and produces more reliable estimates of expression levels within each condition, thus providing fewer false 
differential expression calls. We want to highlight that in order to make most accurate differential expres- 
sion assessment, experimental design must include biological replication. BitSeq is capable of combining 
information from multiple biological and technical replicas and comparing multiple conditions. Further 
studies including multiple replicates are necessary to investigate the effects of library preparation and 
biological variance. 

Acknowledgement 

Funding: This work was supported under the European ERASysBio-l- initiative project SYNERGY by 
the Biotechnology and Biological Sciences Research Council [BB/I004769/2 to M.R.] and the Academy of 
Finland [135311 to A.H.]; by the Academy of Finland [121179 to A.H.]; and the 1ST Programme of the 
European Community, under the PASCAL2 Network of Excellence [IST-2007-216886]. This publication 
only reflects the authors' views. 

References 

Anders, S. and Hubcr, W. (2010). Differential expression analysis for sequence count data. Genome Biol, 11(10), 
R106. 

Cleveland, W. S. (1981). LOWESS: A Program for Smoothing Scatterplots by Robust Locally Weighted Regression. 
Am Stat, 35(1). 

Cloonan, N. et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods, 5(7), 
613-9. 

Dohm, J. C, Lottaz, C, Borodina, T., and Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets 
from high-throughput DNA sequencing. Nucleic Acids Res, 36(16), el05. 

Gclman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis. Chapman and Hall/CRC, 
2 edition. 

Geman, S. and Geman, D. (1993). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of image. 
J Appl Stat, 20, 25-62. 

Graveley, B. R. et al. (2011). The developmental transcriptome of Drosophila melanogastcr. Nature, 471(7339), 
473-479. 

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proc Natl Acad Sci USA, 101 Suppl, 5228-35. 

Hardcastle, T. J. and Kelly, K. A. (2010). baySeq: Empirical Bayesian Methods For Identifying Differential Expres- 
sion In Sequence Count Data. BMC Bioinformatics , 11(1), 422. 

Katz, Y., Wang, E. T., Airoldi, E. M., and Burge, C. B. (2010). Analysis and design of RNA sequencing experiments 
for identifying isoform regulation. Nat Methods, 7, 1009-1015. 

Labaj, P. P., Leparc, G. G, Linggi, B. E., Markillie, L. M., Wiley, H. S., and Kreil, D. P. (2011). Characterization 
and improvement of RNA-Scq precision in quantitative transcript expression profiling. Bioinformatics , 27(13), 
i383-i391. 



11 



Langmead, B., Trapnell, C, Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short 
DNA sequences to the human genome. Genome Biol, 10(3), R25. 

Li, B. and Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a 
reference genome. BMC Bioinformatics , 12, 323. 

Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., and Dewey, C. N. (2010). RNA-Seq gene expression estimation 
with read mapping uncertainty. Bioinformatics , 26(4), 493-500. 

Liu, X., Milo, M., Lawrence, N. D., and Rattray, M. (2006). Probe- level measurement error improves accuracy in 
detecting differential gene expression. Bioinformatics, 22(17), 2107-13. 

Marioni, J. C, Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. (2008). RNA-seq: an assessment of technical 
reproducibility and comparison with gene expression arrays. Genome Res, 18(9), 1509-17. 

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian 
transcriptomes by RNA-Seq. Nat Methods, 5(7), 621-8. 

Nicolae, M., Mangul, S., M, I., and Zelikovsky, A. (2010). Estimation of alternative splicing isoform frequencies from 
RNA-Seq data. In Proc. 10th Intl Conf. on Algorithms in Bioinformatics , pages 202-214. Springer- Verlag. 

Oshlack, A., Robinson, M. D., and Young, M. D. (2010). From RNA-seq reads to differential expression results. 
Genome Biol, 11(12), 220. 

Rattray, M., Liu, X., Sanguinetti, G., Milo, M., and Lawrence, N. D. (2006). Propagating uncertainty in microarray 
data analysis. Brief Bioinform, 7(1), 37-47. 

Roberts, A., Trapnell, C., Donaghcy, J., Rinn, J. L., and Pachter, L. (2011). Improving RNA-Seq expression 
estimates by correcting for fragment bias. Genome Biol, 12(3), R22. 

Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of 
RNA-seq data. Genome Biol, 11(3), R25. 

Robinson, M. D. and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. 
Bioinformatics, 23(21), 2881-2887. 

Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for differential 
expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-40. 

Shi, L. et al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility 
of gene expression measurements. Nat Biotechnol, 24(9), 1151-61. 

Trapnell, C. et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and 
isoform switching during cell differentiation. Nat Biotechnol, 28(5), 516-520. 

Turro, E. et al. (2011). Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. 
Genome Biol, 12(2), R13. 

Wang, X., Wu, Z., and Zhang, X. (2010). Isoform abundance inference provides a more accurate estimation of gene 
expression levels in RNA-seq. J Bioinform Comput Biol, 8, 177-192. 

Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 
10(1), 57-63. 

Wu, Z., Wang, X., and Zhang, X. (2011). Using non-uniform read distribution models to improve isoform expression 
inference in RNA-Seq. Bioinformatics, 27(4), 502-508. 

Xu, G. et al. (2010). Transcriptome and targetome analysis in MIR155 expressing cells using RNA-seq. RNA, pages 
1610-1622. 



12 



Supplementary Information 



A Methods 

A.l Alignment probabilities 

We present the alignment probability computation for the case of paired end reads. For single reads, the 
terms related to fragment or insert length distribution and the other paired read disappear. 

For a given transcript /„ = m; m € {1, . . . , M}, the probability of observing a pair of reads (rjp ,r^) 
is determined by the probability of the read being sequenced from a specific strand s at a specific position 
p with a specific insert length I and the probability of reporting the reads after sequencing the sequences 

( Se Qmlps> Se( lmlps)> 

P(r£\r^\I n = m) = P(l\m)P(p\l,m)P(s\m)P(4P\seq mlps )P(r^\seq mlps ) • (5) 

Unless a strand specific sequencing protocol is used, the probability of observing a read from either strand 
is the same, P(s\m) = 1/2, and can be ignored. The fragment length distribution P(l\m) is assumed to be 
log-normal with its parameters given by the user or estimated from read pairs with only a single transcript 
alignment. 

The probability of sequencing a given position is in general 

P(p\ I n =m,l)= , Vff . (6) 

Yl;=i b m ( P ) 

where b m (p) denotes bias for a particular position p on transcript m. For a constant b m (p) corresponding 
to a uniform read distribution, this reduces to P(p\m) = l/(l m — l r + 1) which only depends on the lengths 
of the transcript l m and the read l r . 

We calculate the probability of observing a sequence based on the read's quality base scores and 
mismatches. 

The Phred score can be converted into probability of base-calling error p err ^. The final sequence 
probability is now obtained as 

P(r^\seq mps ) = (l-jPerr,i) Pevv,i, (7) 

iGmatches id mismatches 

where the probability of error for a given base i is based on the Phred score p eTT} i = io _Phredi / 10 . 
A. 1.1 Bias estimation 

Our model can easily incorpora te a correction for p osition and sequence specific biases. One example 
of such a model is presented by iRoberts et al. (jmil) 

for correcting the fragmentation bias. Under this 



model, we have 

UP) = 6^(e5)6^ 3 (e3)^ 5 (e 5 )^ 3 (e 3 ), (8) 

where b S m{e§) and b S m{e^,) are the sequence specific biases for 5' and 3' ends of the fragment, respectively, 
and ^ 5 (es) and 6m 3 (es) are the corresponding positional biases. 

We use separate variable length markov model s to capture the bias for each end. The structure of this 
model is the same as that of IRoberts et al\ (j201lh . presented in Figure 2 of the supplementary methods. 
For the sequence bias these are 

^) = n tII- ( 9 ) 
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which are based on 21 probabilities i/j^ Vn from 8 bases before and 12 bases after the read starting position. 
Here ip 5,R refers to the biased and (j) 5,u to a uniform model, n is a node or a position, 7r n are the parents 
of node n and ijj^ n is the probability of base X at node (or position) n given the bases observed on 
parent nodes ir n . The model has 744 parameters in all, with each node having 0, 1 or 2 parents as in 
the model of Roberts et al. ( 201ll ). The parameters are estimated from empirical frequencies using reads 
with a single alignment. For a read r aligning to transcript m we increase appropriate probabilities ip 5,R 
by l/9 m , where 9 m is an initial coarse expression estimate obtained by running BitSeq with uniform read 
distribution model beforehand. In the contrasting uniform model for all K = l m — l r + 1 possible positions 
of read of length l r , the appropriate probabilities ip 5 ' U are increased by -p~K- ^ ne m °del b S m (e^) is similar. 

In addition to the sequence-specific bias, there is a model for positional bias within the transcript. 
This is 

^ 5 (e 5 ) = (10) 

where uj^ p is the probability for starting position within transcript of length I on position p. The proba- 
bilities are modelled within 5 transcript length bins and 20 bins of relative position. The probabilities are 
again estimated from empirical frequencies of reads with single alignments taking into account expression 



A. 2 Effective length computation 

For the purpose of reporting normalized measure such as RPKM, 9, the relative expression of fragments, 
has to be normalized by the amount of reads or fragments that can be produced by a unit of transcript. 
When assuming uniform read distribution of single-end reads, this would be l m — l r as the number of 
starting positions for a read of length l r . For pair-end reads, the effective length of a transcript has to 
account for fragment length distribution as well, 

lWf) = J2p(lf\™)*(lm-lf). (11) 

'/=1 

With the use of read distribution with bias correction, we learn more about the distribution of fragments 
and thus can use this information when computing the effective length. In this case, the effective length 
takes into account bias weight for every position of the transcript, 

lm l m—l f 

lW+Mas) = J2p(l f \ m ) £ b m (p) (12) 

lf=l p=l 

As we show later in Section IB.2I of this Supplementary material, using the bias corrected effective length 
can substantially improve the accuracy of our method. 



A. 3 Gibbs sampling in expression estimation (Stage 1) 

We apply a collapsed Gibbs sampler for Stage 1 estimation by marginalising out the expression level and 
noise level parameters 6 and 9 act and iteratively resampling the isoform assignments I n of each read given 
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the assignments of other reads /( n ) . The full update rules for the sampler are 

P{I n \I^ n \R) = C & t{I n \4>* n ), (13) 
Co = P(r„|noise)(/3 arf + C { ~ n) )/Z^\ 

m^O; <t>*nm = P(rn\In)(* aCt + 4" »> ) ] , 

<4" n) =E*n*(*=™). 
4" n) =E^n^>0), 

with being a constant normalising 0^ to sum up to 1, and a dir = 1, a act = 2, f3 act = 2. 

As an alternative, it is also possible to use a regular Gibbs sampler alternating between sampling I n 
and 0. The corresponding update rules are 

p(i n \e,e act ,R) = c a ,t(i n \<p n ), (u) 

<p n0 = P(r n \no^){l-9 act )/Z^\ 
m ± 0; nm = P(r n |/ n )fl m # ac 7zW, 
P(0|I, fl) = Biv(6\(a dir + Ci, . . . , a dir + C A/ )), (15) 
P(6 act \I,0,R) = Beta(6 act \a act + N - C , P act + C ), (16) 



This approach is usually less efficient in practice, though. 
A. 4 Differential Expression model (stage 2) 

The Differential Expression (DE) model is shown in Figure 3 of the main paper. We consider data from 
conditions c = 1 . . . C with number of replicates for each condition denoted R± , . . . , Rc • We fit the model 
to each transcript m independently using "pseudo-data" ym = logfl^T which is created from MCMC 
samples from Stage 1. One sample of 6^ is drawn for each (r, c) combination to create a pseudo-data 
vector y m of length Ylc=i ^c- Inference is carried out independently for each pseudo-data vector and the 
results are then combined as described in the main text. This allows the technical error from Stage 1 to 
be propagated through the model. Since the model is conjugate then the inference for each pseudo-data 
vector is exactly tractable and no further MCMC is required to sample the condition means. 



A. 4.1 Parameter estimation for each transcript 

The condition means are denoted fi m = (fJm , . . . , fim ) and we are interested in inferring the posterior 
distribution over the means given one pseudo-data vector y m . The model is defined as, 

yW~Norm(/x&>,l/A#) 
^)~Norm(^,l/(A AW)) 
A$ ~ Gamma(«G, (3q) 

with hyper-parameters \o,olg,Pg which are estimated from groups of transcripts with similar mean ex- 
pression across conditions. The hyper-parameter /im is fixed at the empirical mean transcript expression 
across conditions. 
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\y m ) <xp(y 

III } 

C R c 



c=l 
C* 



r=l 



OC 



J Gamma(A^ |o c , 6 c )Norm /i c 



c=l 



a c = a G + — 



7 fl .if. (0)2 ^ Q 2 (Ao/^+ffj/c) 2 ^ 

^ = A?+ 2^ +5?/C Ao + ^ c J 



where 5yc denotes 2/"^ an, l £ 2 2/ c denotes X^=i 



.^2 



A. 4. 2 Hyper-parameter estimation across transcript groups 

For hyper-parameter estimation we consider a set of transcripts m = 1 . . . M' in a group g of transcripts 
with similar expression. The hyperparameter fj,Q is now set to the mean expression per group of transcripts 
and Ao is set to 2.0. We have pseudo-data samples for each transcript and we are interested in 
hyperparameters a and /3, where (5 is the rate of Gamma distribution. The model is defined as, 

y& ~Norm(/$,l/A#) 
^~Normrf,l/(AMA )) 
a$ ~Gamma(a, ft) 
P(a, f3) ~Uniform(0, oo) 

The hyper-parameter posterior distribution is given by, 



P(a,P\y) <xP(a,P)P(y\a,P) 

M' C 
m=l c=l 

M' C p p R c 

« II II / d\tip(\ti\a,(3) / d$P($\\$)]lP(yW\\k\iJ$: 

m=l c=l ^ ^ r=l 

/3° r(a + R, " 

oc 1 



1= lc=l r (°) fa.lf x „(0) 2 , Q2 _ (Ap^'+^c)^ 



This distribution is not in a standard form and we use Metropolis-Hastings Random walk MCMC to 
sample a and /3. We then use lowess smoothing across groups to estimate the mean hyper-parameter for 
each transcript according to its empirical mean expression level across conditions. 
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(d) Transcript sequence profile. 

Figure 7: In plots (a), (b) and (c) we show the posterior transcript expression density for pairs of transcripts 
from the same gene. This is a density map constructed using the MCMC expression samples for these 
three transcripts. In (d) we show transcript sequence profil e obtained from the UCSC genome browser. 
The sequencing data is from miRNA-155 study published by Xu et al. ( 2010l ). 



B Results 

B.l Transcript expression inference 

In the main text (Figure 4) we illustrate the correlation present in the expression posterior distribution 
for transcripts that share a large proportion of transcribed sequence. Here we provide all three pairwise 
plots for the transcripts ucOlOoho.l, ucOlOohp.l and uc001bwm.3, which are the only transcripts of gene 
Q6ZMZ0 in the UCSC known Gene annotation. The expression samples of uc001bwm.3 and ucOlOohp.l 



(Figure 7(c) ) are also negatively correlated. This means that the model is not able to decide from which 
transcript some of the reads originated and the posterior distribution captures all viable assignments. The 
transcript sequence profile in Figure |7(d)| clearly demonstrates the similarity of the transcripts that causes 
higher uncertainty when inferring the transcript expression levels. 



B.2 Read distribution bias correction 

We compared four different methods for expression estimation which include bias correction options for 
non-uniform read distribution. The extended results are presented in Table [2j where we report the R 2 
correlation of 893 transcript expression estimates with the TaqMan qRT-PCR results. We used every 
method to analyse each of the three technical replicates separately and then used the average expression 
level for the comparison. As was already stated in the main paper, the newest stable version of Cufflinks 
does provide the lowest correlation. We resorted to using the version 0. 9 .3 wh ich was used in the paper 
presenting the bias correction method adopted by BitSeq ( Roberts et all lio 111 ). 

For BitSeq, the major benefit of the bias correction algorithm comes from the effective transcript length 
normalisation. Relative expression of fragments used by BitSeq can be converted into relative expression 
of transcripts or into RPKM measure by adjusting the expression by effective length (see Supplementary 
Section 1). In Table[2]we compare three different approaches for length normalisation. In the first approach 
(*), the expression is adjusted by the length of a transcript. The second approach (t) uses effective length 
taking into account the paired-end read fragment length distribution and the number of all positions from 
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Method ver. 


Read distribution 


Average 


Rep. 1 


Rep. 2 


Rep. 3 


BitSeq 0.4 


uniform * 


0.7585 


0.7575 


0.7580 


0.7594 


BitSeq 0.4 


uniform ' 


0.7677 


0.7672 


0.7669 


0.7675 


BitSeq 0.4 


bias corrected * 


0.7565 


0.7554 


0.7561 


0.7573 


BitSeq 0.4 


bias corrected < 


0.7652 


0.76495 


0.7647 


0.7652 


BitSeq 0.4 


bias corrected * 


0.8011 


0.8018 


0.7959 


0.8041 


Cufflinks 0.9.3 


uniform 


0.7503 


0.7470 


0.7513 


0.7519 


Cufflinks 0.9.3 


bias corrected 


0.8056 


0.8018 


0.8050 


0.8083 


Cufflinks 1.3.0 


uniform 


0.5331 


0.5130 


0.5336 


0.5477 


Cufflinks 1.3.0 


bias corrected 


0.6842 


0.6858 


0.6917 


0.6446 


RSEM 1.1.14 


uniform 


0.7632 


0.7623 


0.7628 


0.7640 


RSEM 1.1.14 


bias corrected 


0.7633 


0.7623 


0.7628 


0.76409 


MMSEQ 0.9.18 


uniform 


0.7614 


0.76099 


0.7606 


0.7620 



Table 2: Evaluation of transcript expression inference algorithms using the SRA0 12427 RNA-seq data and 
TaqMan qRT-PCR expression measures for 893 matching transcripts. Reported values are Pearson R 2 
correlation coefficient of the 893 transcripts' expression estimates and qRT-PCR results, best correlation 
of a method using averaged expression is highlighted. For each method we present values for average 
expression taken from three replicates as well as for each technical replicate separately. BitSeq was used 
with three different versions of expression length normalisation: * - using actual transcript length, f - 
using effective length accounting for fragment length distribution, I - using effective length accounting for 
fragment length and read distribution bias. 

which a fragment could originate. The last approach (£), which provides best results on this dataset, uses 
effective length computed using the fragment length distribution as well as read distribution bias weights 
(see Equation I12|> . More careful investigation of this process is required, however it is limited by small 
number of RNA-seq datasets with known underlying expression, especially when using paired-end reads. 

B.3 Assessing transcript expression inference using simulated data 

We used simulated dataset of 10M paired-end reads to examine the expression estimation accuracy of Bit- 
Seq and compared it against other three popular methods. The reads were generated based on expression 
estimates from the Xu et al. dataset with a fragment size distribution If ~ LogNorm(5.32, 0.12), inferred 
from the SRA012427 dataset. We used the UCSC NCBI37/hgl9 knownGene annotation transcripts to 
generate the read fragments. First we compared the overall expression accuracy against the generative 
read count values (Figure [5]). We used read count in order to facilitate the second part of our comparison, 
the assessment of the within gene relative expression estimation (Table [3]) , for which the use RPKM would 
not be feasible. 

For comparison of overall expression accuracy, we report the Pearson R 2 correlation coefficient with 
the ground truth. The coefficient was calculated for transcripts with at least one read generated (46841 
transcripts). In this comparison RSEM (R 2 = 0.998) has the highest correlation with MMSeq (R 2 = 0.997) 
and BitSeq {R 2 = 0.995) being closely behind. Unfortunately we again have to report poor results for the 
latest version of Cufflinks (R 2 = 0.307) with the version 0.9.3 still performing worse than the other three 
methods (R 2 = 0.784). 

In the withing gene expression comparison (Table [3j), we used two cutoffs for relevant transcripts. The 
first taking into account transcripts for which their gene has at least 10 reads in the ground truth (45662 
transcripts) and the second considering only transcripts for which the gene has at least 100 reads (33757 
transcripts). BitSeq performs the best for the narrow range of transcripts with RSEM and MMSEQ 
having comparable results. For the less stringent criteria, BitSeq still retains very good correlation with 
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Figure 8: Comparison of expression estimates using 10M simulated pair-end reads with known expression. 
The expression estimates were converted into estimated read counts for each transcript and compared 
against ground truth using Log-Log plot. We calculated Pearson R 2 correlation coefficient for transcripts 
with at least one generated paired-end read. The figures show (a) BitSeq, (b) Cufflinks vO.9.3, (c) RSEM, 
and (d) MMSEQ. 



the ground truth while the performance of the other two methods deteriorates. As we are using the same 
dataset, both versions of Cufflinks provide poor correlation when compared to other three methods. 





BitSeq 


Cufflinks 


Cufflinks 0.9.3 


RSEM 


MMSEQ 


above 10 reads 
above 100 reads 


0.951 
0.964 


0.205 
0.176 


0.739 
0.787 


0.876 
0.945 


0.888 
0.948 



Table 3: The R 2 correlation coefficient of estimated within- gene relative expression and ground truth. The 
correlation was calculated for two groups, first one containing transcripts of genes with at least 10 reads 
and the second one containing transcripts of genes with at least 100 reads according to the ground truth. 
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Figure 9: Comparison of standard deviation of posterior samples within single dataset and combined 
datasets of technical replicates and biological replicates, with log RPKM expression on the x-axis and 
standard deviation of log RPKM expression on y-axis. The standard deviation is a sliding average over 
groups of transcripts with similar expression in order to highlight its dependents on the expression. 



B.4 Biological variance of RNA-seq data 

We used RNA-seq data from the microRNA target identification study ( Xu et al. . 20 id ) to test and com- 
pare the differential expression analysis method used in BitSeq Stage 2. This dataset contains technical 
as well as biological replicates for each studied condition allowing assessment of the effects of biological 
variation. Similarly to previous results ( Anders and Huber . 2O10l : Qshlack et al. . 2010l ). we observe signifi- 
cant biological variation within conditions. Figure [U] shows the standard deviation of transcript expression 
level posterior MCMC samples as a function of the mean expression level of the transcript. We compare 
the standard deviation for samples from within one experiment, between two technical replicates and be- 
tween two biological replicates. In order to calculate the standard deviation between replicates we took 
the squared root of variance which was estimated by computing mean square distance between samples. 
Plotted values are averaged for a sliding window of similarly expressed transcripts. The MCMC sample 
variation captures the intrinsic estimation variance in the "within-experiment" case. The technical vari- 
ance includes a contribution due to re-sequencing the same biological sample while the biological variance 
includes a contribution due to repeating the experiment. 

We see that with higher expression the variation of the expression level estimation decreases as expected. 
At high expression levels the variance associated with technical replicates approaches the level of the within- 
experiment variance. On the other hand, the biological variance becomes relatively more significant in 
this regime. Without consideration of biological differences, high confidence of expression estimation of 
these transcripts will lead to false differential expression calls. It can also be observed that the within- 
experiment variance is a significant contribution to replicate variance (technical and biological) at lower 
expression levels. Therefore the intrinsic variance due to mapping ambiguity and limited read depth, as 
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estimated by our MCMC expression estimation procedure, will provide useful information for assessing 
replicate variance in this low expression regime. 



B.5 Assessing DE performance with simulated data 
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Figure 10: ROC curves averaged over 5 runs with standard deviation depicted by error bars. The curve 
was calculated for transcripts with average of at least one read in the ground truth. The fold change was 
uniformly distributed in the interval (1.5,3.5). 

We carried out extensive assessment of DE analysis accuracy of BitSeq with comparison to other 
methods. The Cuffdiff method from Cufflinks package ( Trapnell et al . 2010l ) is the only other method 
designed for transcript level DE analysis that uses replicates and accounts for biological variation. We 
also included three popu lar meth ods which are primar ily desig ned for gene level DE analy sis (DESeq 
( Anders and Huber , 20ld ). edgeR ( Robinson et all 20ld ). baySeq ( Hardcastle and Kelly . 20ld )). but given 
the lack of other options and their input being only the read count vectors, they could be considered for 
the transcript level analysis use case as well. Using expression estimates obtained by BitSeq Stage 1, 
we converted the relative expression of fragments into read counts by simply multiplying it by the total 
number of aligned reads and used this as an input for the gene-level methods. For each of these methods 
we used default parameter settings according to the packages' vignettes. 

The Figure [10] shows the same ROCs as Figure 6(a) in the main paper without the 0.2 cutoff. The 
evaluation is only for transcripts with at least one generated read on average with fold change being 
uniformly generated from the interval (1.5, 3.5). In this figure, the error bars depict the standard deviation 
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Figure 11: Differential expression analysis of simulated data with various levels of fold change. The figures 
focus on the most relevant region with false positive rate above 0.2, and showing the y-axis up to true 
positive rate 0.65. The sub figures show data with fold change: 1.5 (a), 2.0 (b), 2.5 (c), 3.0 (d) and 5.0 
(e). 



for the averaged curves showing consistent results trough the experiments. We can see that BitSeq performs 
slightly better than the other methods with baySeq having higher true positive range in area with above 
0.4 false positive range, however this area is not interesting from the application perspective. 

In the very last figure (jlip , we compare the accuracy of these methods with respect to the fold change 
of differentially expressed transcripts. We again restrict the figures to the area with false positive rate 
below 0.2 which in our opinion is the most important in terms of applicability. Instead of using randomly 
selected fold change, all differentially expressed transcripts are either up-regulated or down-regulated by 
constant fold change. The increase of fold change clearly improves the performance of the methods as 
we expected. BitSeq and baySeq have consistently better results than the other methods except for the 
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lowest fold change 1.5, in which baySeq has the lowest true positive rate and edgeR with DESeq outperform 
BitSeq in half of the spectrum. 

In all of our DE experiments, Cuffdiff, despite being designed for transcript level analysis performs 
worse out of the 5 compared algorithm. This could be largely attributed to the expression estimation 
problem, however for DE analysis return to the older version (0.9.3) did not improve the results, possibly 
because of different DE model. Our data also shows that for most parts, the DESeq and edgeR methods 
produce very similar results in terms of accuracy. We have to note, that even though we tried to simulate 
the data in way to resemble real RNA-seq experiments, the data proved to be rather hard for all methods 
being compared. 
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